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Abstract 


The variation al framework fo r learning induc¬ 
ing variables ( Titsias . 2009all has had a large 
impact on the Gaussian process literature. The 
framework may be interpreted as minimizing a 
rigorously defined Kullback-Leibler divergence 
between the approximating and posterior pro¬ 
cesses. To our knowledge this connection has 
thus far gone unremarked in the literature. In 
this paper we give a substantial generaliza¬ 
tion of the literature on this topic. We give 
a new proof of the result for infinite index sets 
which allows inducing points that are not data 
points and likelihoods that depend on all func¬ 
tion values. We then discuss augmented in¬ 
dex sets and show that, contrary to previous 
works, marginal consistency of augmentation is 
not enough to guarantee consistency of varia¬ 
tional inference with the original model. We 
then characterize an extra condition where such 
a guarantee is obtainable. Finally we show 
how our framework sheds light on interdomain 
sparse approximations and sparse approxima¬ 
tions for Cox processes. 


1 Introduction 


The variati onal a pproach to inducing point selection 
of Titsias (|2009all has been highly influential in the 
active research area of scalable Gaussian process ap¬ 
proximations. The chief advantage of this particular 
framework is that the inducing points positions are 


variational parameters rather than model parame¬ 
ters and as such are protected from overfitting. In 
this paper we argue that whilst this is true, it may 
not be for exactly the reasons previously thought. 
The original framework is applied to conjugate like¬ 
lihoods and has beeii extended to n on-co njugate 
likelihoods ( Ghail . 2012 : Hensman et ah . 2015). An 
important advance in the use of variational methods 
was their comb i nation with stochastic gradient descent 
( Hoffman et al. . 20131) and the variational inducing 
point framework has been combined with su ch methods 
in the conjuga te (iHens man et al.. 2013 1 and non¬ 
conjugate cases ( Hensman et al. . 20151) . The approach 
has also been successfully used to perform scalable 
inference in more complex model s such as the Gaussian 
proce s s latent variable model ( Titsias and Lawrencel 
2 OIOI : Damianon et al.l. 2014 ) and the relate d Deep 


Gaussian process JPa miaiion and Lawrence . 

Hensman and Lawrencel 20141). 


2012 : 


To be more concrete let us set up some notation. Gon- 
sider a function / mapping an index set X to the set 
of real numbers / : A 1 —K. Entirely equivalently we 
may write / € or use sequence notation {f{x))^^x- 
We also define set indexing of the function. If 5” C A 
is some subset of the index set, then fs '■= 
and we may straightforwardly extend this definition to 
single elements of the index set := f{x}- We can 
put this notation to immediate use by defining a subset 
D C A of the index set, of size N, that corresponds to 
those input points for which we have observed data. The 
corresponding function values will then be denoted fn- 
For simplicity, we will initially assume that we have one, 
possibly noisy, possibly non-conjugate observation y per 
input data point which will together form a set Y. 


Gaussian processes allow us to define a prior over func- 
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tions /. After we observe the data we will have some 
posterior which we wish to approximate with a sparse 
distribution. At the heart of the variational inducing 
point approximation is the idea of ‘augmentation’ that 
appears in the original paper and many subsequent ones. 
We choose to monitor a set Z C X of size M. These 
points may have some overlap with the input data points 
D but to give a computational speed up M will need to 
be less than the number of data points N. The Kullback- 
Leibler divergence given as an optimization criterion in 
Titsias’ original paper is 

ICCWD\zJz)\\pifD\zJz\Y)] 

= ( 1 ) 

The variational distribution at those data points which 
are not also inducing points is taken to have the form: 


qifD\zJz)--=p{fD\z\fz)q{fz) (2) 


where p(fD\z\fz) is the prior conditional and q{fz) is 
a variational distribution on the inducing points only. 
Under this factorization, for a conjugate likelih ood, the 
optim al q{fz) has an analytic Gaussian solution ( Titsiad . 
2009all . The non- conju g ate c a se was then studied i n 
subsequent work ( Chail . I2OI2I: Hensman et ah . l2015ll . 
In both cases the sparse approximation requires only 
0{NM‘^) rather than the 0{N^) required by exact meth¬ 
ods in the conjugate case, or many commonly used non¬ 
conjugate approximations that don’t assume sparsity. 

The augmentation is justified by arguing that the model 
remains marginally the same when the inducing points 
are added. It is therefore suggested that variational in¬ 
ference in the augmented model, including for the pa¬ 
rameters of said augmentation, is equivalent to varia¬ 
tional inference in the original model, i.e that the induc¬ 
ing point positions can be considered to be variational 
parameters and are consequently protected from overfit- 
ting. For exam ple see Titsias’ original conference paper 
( Titsiasl . 2009a), sectio n 3 or the longer technical report 
version ( Titsiasl . 2009bll . section 3.1. In the common case 
in the literature where the argument proceeds by ap¬ 
plying Jensen’s inequality to th e mar ginal likelihood as, 
for example, in Hensman et al ( 20151) equations (6) and 
(17), the slack of the bound on the marginal likelihood is 
precisely the /C£-divergence (HD. Therefore maximizing 
such a bound is exactly equivalent to minimizing this 
objective and the considerations that follow all apply. 


In fact in this paper, whilst we applaud the excellent 
prior work, we will show that variational inference in an 


augmented model is not equivalent to variational infer¬ 
ence in the original model. Without this justification, 
the /C£-divergence in equation (HD could seem to be a 
strange optimization target. The /Ci3-divergence has the 
inducing variables on both sides, so it might seem that 
in optimizing the inducing point positions we are try¬ 
ing to hit a ‘moving target’. It is desirable to rigorously 
formulate a ‘one sided’ /C£-divergence that leads to Tit¬ 
sias’ formulation. Such a derivation could be viewed as 
putting these elegant and popular methods on a firmer 
foundation. Such a derivation is the topic of this arti¬ 
cle. As we shall show this cements the framework for 
sparse interdomain inducing approximations and sparse 
variational inference in Cox processes. We wish to re¬ 
emphasize our respect for the previous work and for the 
avoidance of suspense we will find that much of the ex¬ 
isting work carries over mutatis mutandis. Nevertheless 
we feel that most readers at the end of the paper will 
agree that a precise treatment of the topic should be of 
benefit going forward. 


In terms of prior work for the theoretical aspect, t he ma¬ 
jor oth er references are the early work of Seeger ( 2003aL 


2003bll . In particular Seeger identifies the /C£-divergence 


between processes (more commonly referred to as a rel¬ 
ative entropy in those texts) as a measure of similarity 
and applies it to PAC-Bayes and to subset of data sparse 
methods. Crucially, Seeger outlines the rigorous formu¬ 
lation of such a /Ci3-divergence which is a large technical 
obstacle. Here we give a shorter, more general, and intu¬ 
itive proof of the key theorem. We extend the stochas¬ 
tic process formulation to inducing points which are not 
necessarily selected from the data and show that this is 
equivalent to Titsias’ formulation. In so far as we are 
aware this relationship has not previously been noted in 
the literature. The idea of using the /C£-divergence be¬ 
tween processes is also mentio ned in the early work of 
Csato and Opper ( 20021 : 2002) but the transition from 
hnite dimensional multivariate Gaussians to infinite di¬ 
mensional Gaussian processes is not covered at the level 
of detail discussed here. An optimization target that in 
intent seems to be similar to a /C/l-divergence between 
stochasti c pro cess is briefly mentioned in the work of 
Alvarez (I2OIII) . The notation used suggests that the 
integration is with respect to an ‘infinite dimensional 
Lebesgue measure’, which as we shall see is an argu¬ 
ment that arrives at the r ight a nswer via a mathemat¬ 
ically flawed route. Chai (|2012l) seems to have been at 
least partly aw are of Seeger’s /C£-divergence theorems 
( Seeged . 2003d) but instead uses them to bound the fi¬ 
nite joint predictive probability of a non sparse process. 


This article proceeds by first discussing the finite di- 
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mensional version of the full argument. This requires 
considerably less mathematical machinery and much of 
the intuition can be gained from this case. We then 
proceed to give the full measure theoretic formulation, 
giving a new proof that allows inducing points that are 
not data points and for the likelihood to depend on in¬ 
finitely many function values. Next we discuss augmen¬ 
tation of the original index set, using the crucial chain 
rule for /C£-divergences. This gives us a framework to 
discuss marginal consistency and how variational infer¬ 
ence in augmented models is not necessarily equivalent 
to variational inference in the original model. We then 
show that under very general conditions augmentation 
which is deterministic conditioned on the whole latent 
function does have the desired property. We apply our 
results to sparse variational interdomain approximations 
and to posterior inference in Cox processes. Finally we 
conclude and highlight avenues for further research. 


2 Finite index set case 

This section is in fact a less general case of what follows. 
It is included for the benefit of those familiar with the 
previous work on variational sparse approximations and 
as an important special case. Consider the case where 
X is finite. We introduce a new set * := X\{D U Z), 
in words: all points that are in the index set that aren’t 
inducing points or data points. These points might be of 
practical interest for instance when making predictions 
on hold out data. 

We extend the variational distribution to include these 
points: 


q{f*,fD\z, fz) ■■= p{f*, fD\z\fz)q{fz)- (3) 

We then consider the /C£-divergence between this ex¬ 
tended variational distribution and the full posterior dis¬ 
tribution p{f\Y) 


mq{f*JD\zJz)\\pif\Y)] 

= )CC[qif*JD\z, fz)\\pif*jD\Z, fz\Y)] 

/ / p p T\i ^^ ^ ^ \Z ') f z'} jp jp jp 

log 

( 4 ) 


Next we expand the term inside the logarithm and cancel 
one of the terms that appears in both the numerator and 


the denominator: 

q{f*, fD\z, fz) 
p{f*, fD\z, fz\Y) 

Pif*\fD\zJz)pifD\z\fz)qifz)p{Y) 

p{f*\fD\z, fz)p{fD\z\fz)p{fz)p{Y\f d) 
p{fD\z\fz)q{fz)p{Y) 
P{fD\z\fz)p{fz)p{Y\fD) 

_ q{fD\z,fz) , , 

P{fD\Z,fz\Y) 

Substituting back into the full integral and exploiting 
the marginalization property of the conditional density 
we obtain: 

Jp{f*jD\z\fz)q{fz) log ‘V*‘VD\zdfz 

= jP{fD\z\fz)q{fz) log ^fD\zdfz (6) 


The last lin e is exactly the /C£-divergence used by Tit- 
sias (|2009all that we already described in equation ([T]). 
We thus see that for finite index sets considering the 
Af£-divergence between the two distributions is equiva¬ 
lent to Titsias’ /C£-divergence. We might choose to opti¬ 
mize our choice of the M by selecting them from the |A| 
possible values in the index set and comparing the JCC- 
divergence between distributions given in equation 0. 
The equivalence with equation o that we have just de¬ 
rived shows us that in this case the appearance of the 
inducing values on both sides of the equation is just a 
question of ‘accounting’. That is to say, whilst we are in 
fact optimizing the /C£-divergence between the full dis¬ 
tributions, we only need to keep track of the distribution 
over function values fz and fD\z- All the other function 
values /* marginalize. For different choices of inducing 
points we will need to keep track of different function 
values and be able to safely ignore different values /*. 


3 Infinite index set case 


3.1 There is no useful infinite dimensional 
Lebesgue measure 


One might hope to cope with not only finite index sets 
but also infinite index sets in the way discussed in sec¬ 
tion [21 Unfortunately when X and hence /* are infinite 
sets we cannot integrate with respect to a ‘infinite di¬ 
mensional vector’. That is to say the notation f (-jd/* 
can no longer be correctly used. 


For a discussion of this see, for example, Hunt et al 
(|l992h . The crux of the issue is that to give sensible 
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answers such a measure would need to be translation in¬ 
variant and locally finite. Unfortunately the only mea¬ 
sure that obeys these two properties is the zero measure 
which assigns zero to every input set. 


Thus we see that it will be necessary to rethink our ap¬ 
proach to a /C£-divergence between stochastic processes. 
It will turn out that a reasonable definition will require 
the full apparatus of measure theory. Readers looking 
for some backgro und on the s e issues may wish to con¬ 
sult a larger text ( Billinsfsl^ . ll99,4ICaDinski and Koi^ 
200411 . 


3.2 The Af£-divergence between processes 


In this section we review the rigorous definit ion of 
the / C£-divergence between stochastic processes (IGrav . 


201111 . 


Suppose we have two measures /i and rj for (11, S) and 
that fj, is absolutely continuous with respect to rj. Then 
there exists a Radon-Nikodyn derivative ^ and the cor¬ 
rect definition for Af£-divergence between tliese measures 
is: 

= ^log|^|d/x. (7) 

In the case where fi is not absolutely continuous with 
respect to rj we let /C£[^||? 7 ] = c». In the case where the 
sample space is for some finite K and both measures 
are dominated by Lebesgue measure m this reduces to 
the more familiar definition: 


ICC[fi\\r]] = j ulog| —|dm 


( 8 ) 


where u and v are the respective densities with respect to 
Lebesgue measure. The first definition is more general 
and allows us to deal with the problem of there being 
no sensible infinite dimensional Lebesgue measure by in¬ 
stead integrating with respect to the measure 


3.3 A general derivation of the sparse inducing 
point framework 

In this section we give a general derivation of the sparse 
inducing point framewor k. The deriva tion is more gen¬ 
eral than that of Seeger ( 2003aL [2003b 1 since it does not 
require that the inducing points are selected from the 
data points. Nor does it assume that the relevant finite 
dimensional marginal distributions have density with re¬ 
spect to Lebesgue measure. Finally since the dependence 
on the elegant properties of Radon-Nikodym derivatives 
has been made more explicit we believe it is clearer why 
the derivation works and how one would generalize it. 


We are now interested in three types of probability mea¬ 
sure on sets of functions / : A M. The first is the 
prior measure P which will be assumed to be a Gaus¬ 
sian process. The second is the approximating mea¬ 
sure Q which will be assumed to be a sparse Gaussian 
process and the third is the posterior process P which 
may be Gaussian or non-Gaussian depending on whether 
we have a conjugate likelihood. We start with a mea¬ 
sure theoretic definition of Ba yes’ theorem for a domi¬ 
nated model (|Schervishl . Il995h . It specifies the Radon- 
Nikodym derivative of the posterior with respect to the 
prior. 


^(n = WIl 

dp^J) i(Y) 


(9) 


with L{Y\f) being the likelihood and L{Y) = 
jgx L{Y\f)dP{f) the marginal likelihood. As we have 
assumed in previous sections we will initially restrict the 
likelihood to only depend on the finite data subset of 
the index set. We denote by ttc ■ a projection 

function, which takes the whole function as an argument 
and returns the function at some set of points C. In this 
case we have: 



dpD 

dPo 




HYlnpU)) 

L{Y) 


( 10 ) 


and similarly the marginal likelihood only depends 
on the function values on the data set L{Y) = 
/gD L(Y\fD)dPD{fD)- In fact, we will relax the assump¬ 
tion that the data set is finite in section and the abil¬ 
ity to do so is one of the benefits of this framework. Next 
we specify Q by assuming it has density with respect to 
the posterior and thus the prior and that the density 
with respect to the prior depends on some set of points 

= (n) 

Under this assumption Q is fully specified if we know P 
and . To gain some intuition for this assumption we 
can compare equations (HD and (nni). We see that in the 
approximating distribution the set Z is playing a similar 
one to that played for D in the true posterior distribu¬ 
tion. We now bring these assumptions together. Let us 
apply the chain rule for Radon-Nikodym derivatives and 
a standard property of logarithms: 

icc[Q\\P] 

=L ■ 

( 12 ) 
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Taking the first term alone we exploit the sparsity as¬ 
sumption for the approximating distribution: 

= (13) 

Taking the second term in the last line of equation (HU 
and exploiting the measure theoretic Bayes’ theorem we 
obtain: 


=Eq, [iogL(r|/, 5 )]-iogL(y). (14) 


Finally noting the appearance of a marginal K,C- 
divergence we obtain our result: 


KC[Q\\P] = K.C[Qz\\Pz]-V.q, [logL(y|/^)]+logL(y). 

(15) 

As is common with variational approximations, in most 
cases of interest the marginal likelihood will be in¬ 
tractable. However since it is an additive constant, inde¬ 
pendent of Q, it can be safely ignored. The final equa¬ 
tion shows that we need to be able to compute the KC- 
divergence between the inducing point marginals of the 
approximating distribution and the prior for all Z C A 
and the expectation under the data marginal distribu¬ 
tion of Q of the log likelihood. In the case where the 
likelihood factorizes across data terms this will give a 
sum of one dimensional expectations. Note th e similar¬ 
ity of equation m with iHensman et al.l (1201511 equation 
(17) where a less general expression is motivated from 
a ‘model augmentation’ view. Notice that at no point 
in our derivation did we try to envoke the pathological 
‘infinite dimensional Lebesgue measure’ which is impor¬ 
tant for the reasons discussed in section ISTl The ease 
of derivation suggests that Radon-Nikodym derivatives 
and measure theory provide the most natural and gen¬ 
eral way to think about such approximations. 


4 Augmented index sets 

We now consider the case where we supplement the orig¬ 
inal (finite or infinite) index set X with a finite set of 
elements I, intending to use them as inducing points. 
The precise nature of the augmented prior model will 
be parameterized by some parameters 0 which we will 
hope to tune to give a good approximation. It will be 


seen that the this is very much in the spirit of t he orig - 
inal augmentation argument given by Titsias (|2009all 
and the ‘variati onal c ompression’ framework of Hensman 
and Lawrence (1201411 . This setup also covers the case 
of variational ‘interdomain’ Gaussian processes which 
were mooted but no t imp lemented in Figueiras-Vidal 
and Lazaro-Gredilla ( 20091 1 and implemented under the 
basis of the marginal consistency argument in Alvarez 
et al ( 201lll . We intend to discuss the marginal consis¬ 
tency argument in some detail and we shall deal with 
the thorny issues surrounding the rigorous treatment of 
the various infinities involved. 


Marginal consistency is easily ensured by specifying the 
distribution of the augmented function value points // 
conditioned on the values of the function on the orig¬ 
inal set We denote the corresponding measure as 
Pi\x{‘ ; ^)0- Let Qx = and flj = be the sam¬ 
ple spaces associated with the original index set and 
the augmenting variables respectively. Let Px and Pi 
be their cr-algebras. Marginal consistency states that 
we will be interested in probability measures that have 
the following behaviour on the measurable rectangles 
Ax X Ai G Px X Pi'. 


Px\ji{Ax ^ Ai',0) = f PnxiAi;0)dPxifx). (16) 

JAx 


We have included the augmentation parameters 9 explic¬ 
itly up until now, but for brevity we will omit them in 
what follows. We will make this marginal consistency 
assumption in all that follows. Let us call the overall 
set X U I the ‘union set’. In a similar vein to the previ¬ 
ous section we assume that the approximating measure 
Qxui Las density with respect to the augmented prior 
model Pxui and that the Radon-Nikodym derivative is 
only a function of the augmented function points: 


^^(/xu/) = ^(^/(/xu/)). (17) 

Acting as if the augmented set were the original index 
set we would obtain by a similar argument: 

ICC[Qxui\\Pxui] = lCC[Qi\\Pi] 

-Eqj, [logL{Y\fD)]+\ogL{Y). 

(18) 


Sharp eyed readers, however, will have noted that since 
Pxui depends on the augmentation parameters 0 we are 
back in a situation where we can tune the approxima¬ 
tion on the left hand side and the optimization target 

^Note that for brevity our notation for conditional mea¬ 
sures won’t include the explicit function dependence. For 
example, in this case we omit the explicit dependence on fx ■ 






















On Sparse Variational Methods and the Kullback-Leibler Divergence between Stochastic Processes 


on the right. As we will see in the next section we are 
not necessarily rescued by the marginal consistency ar¬ 
gument. It is not the case in general that ]CC[Qx\\Px\ 
equals 1CC[Qxui\\Pxui]- In fact the relationship is gov¬ 
erned by the chain rule for /C/l-divergences as we shall 
now see. 


augmentation set and index set {I,X) to be defined in 
terms of the old sets as {X\D, D). The chain rule then 
tells us that the /C£-divergence on the data set is not 
in general equal to the /C£-divergence on the index set 
although this is true \i Z <Z D. 


4.1 The chain rule for /C£-divergences 


For what fol lows we will require the chain rule for K,C- 
divergences (jGravl . 1201111 . Let U and V be two Polish 
spaces endowed with their standard Borel cr-algebras and 
let 17 X V be the Cartesian product of these spaces en¬ 
dowed with the corresponding product cr-algebra. Con¬ 
sider two probability measures fJ.uxV, Vuxv on this prod¬ 
uct space and let Hu\V:Vu\v be the corresponding reg¬ 
ular conditional measures. Assume that HuxV is domi¬ 
nated by TjijxV■ The chain rule for /C£-divergences says 
that: 


XC[nuxv\\'>luxv] = {KC[iiu\v\\'nu\v]]+^P[i^v\\'nv]- 

(19) 

The first term on the right hand side is referred to as 
the ‘conditional /CT-divergence’ or ‘conditional relative 
entropy’. 


4.3 Deterministic augmentation 

Here we discuss an important case where the augmented 
/CT-divergence and the unaugmented /C£-divergence are 
indeed equal, namely where the additional variables // 
are a deterministic function h of the function values on 
the original index set fx- A few conceptual points may 
be useful before we go into the detail. First the con¬ 
straint only says that the values are deterministic condi¬ 
tioned on the function over the whole index set and the 
index set itself may be infinite. Usually in practice ei¬ 
ther through noise, finite observations or both, we can’t 
know the latent function exactly and hence in our model 
we won’t know the inducing variables exactly. Second, 
whilst this assumption may initially seem contrived, in 
fact it covers two very important cases: the original 
framework where some inducing points are selected from 
the index set X then ‘copied’ over to I and as we shall 
see later the interdomain inducing point framework. 


4.2 The marginally consistent augmentation 
argument is not correct in general. 

Applying the chain rule for /C£-divergences to the diver¬ 
gence on the union set we obtain: 

^P[Qxyji\\Px\ji] 

= Eq, [lCC[Qi\x\\Pi\x\}+mQx\\Px] 

= Eq, {KC[Qi\x\\Pi\x] )+1CC[Qx\\Px]. (20) 

The final line follows from the fact that in the assumed 
model augmentation scheme the additional variables // 
are conditionally independent of the data given fx ■ This 
relation makes precise our claim that marginal con¬ 
sistency is not enough to guarantee that lCC[Qx\\Px] 
equals K.C[QxliiWPxlii]- In fact this will only be true if 
Qi\x = Pi\Xi Qjc-almost surely. In the case where this 
is not true variational inference in the family of aug¬ 
mented models is not equivalent to variational inference 
in the original model and we will be optimizing a ‘two- 
sided’ objective function. We will consider an important 
condition which ensures the desired equality does hold 
in the next section. 

Before we move on, however, it is also instructive to con¬ 
sider a transformation of the original unaugmented prob¬ 
lem into the augmented problem. Take the transformed 


Having a deterministic function mapping is equivalent 
to having a delta function conditional distribution cen¬ 
tred on the function value. Thus the conditional /C£- 
divergence term in equation (EOl) i-e the expectation of 
the conditional on the right hand side, will be zero if the 
approximating measure Qxvji bas the same delta func¬ 
tion conditional. The next theorem shows that if we 
follow the usual prescription for defining Qxui this will 
indeed be the case. 

4.3.1 The governing theorem on deterministic 
augmentation 

Let [Vlx,Px) and [VLi,Ti) be two Polish spaces and let 
(Hx X Qj,Xx X Xi) be their product space endowed 
with product cr-algebra. Let h : fix i—!> H/ be a Xx/ 
Xj measurable function. We are interested in a measure 
P : Xx X Jx K. which has the following property on 
the measurable rectangles Ax x A/ 

P{AxxAI)=Px{Axr^h-\AI)) ( 21 ) 

where Px ■= P{Ax x H/) is the marginal distribution for 
X. This assumption in turn implies that the marginal 
distribution for / has the form 

Pii^i) = Px{h~^{Ai)) 


( 22 ) 
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which is the push forward measure of Px under the func¬ 
tion h. It is clear that the regular conditional distribu¬ 
tion P/|jc(') has a point measure property: 

Pi\x(.Ai) = ShUx) (^t) • (23) 

Let Px\i{‘) be the regular conditional distribution of 
fx conditioned on //. Next we define a second measure 
Q : Fx X Jv I—>■ R which has the following property on 
measurable rectangles 

Q{Ax X Ai) = f Px\i{Ax)dQi{fi) . (24) 

Ja, 

Finally we assume that Qj << Pj. The theorem states 
that under the assumptions of the previous section the 
marginal distributions of Q have the following property: 

Qi{Ai)=Qx{h-\Aj)). (25) 

That is to say the marginal distribution of Q for Z is 

the push forward measure of Qx under the function h. 
Consequently the approximating distribution for fj con¬ 
ditioned on fx also has the point measure property 


Now we apply the property given by equation (EH) 

= [ ^dPx{h-HAi)nh-^{fi)). (31) 

Jnj dPi 

Now we apply some algebraic manipulations of the inte¬ 
gral: 

/ ^dPxih-\Ai)nh-\fi)) 

Jnj dPi 

= ^ ^dPi{fi) = Qi{Ai) (32) 

as was claimed. 

5 Examples 


Qi\xiAi) = Sh{fx)iAi). 


(26) 


We now give a proof. Starting from the right hand side 
of equation (051) 

Qxih-\Aj)) = Q{h-HAi) X ni) 

= [ Px\i{h-HAi))dQiifi). (27) 

Jcii 

Next since Qj << Pj we apply the Radon-Nikodym the¬ 
orem: 

[ Pxii{h-\Ai))dQi{fj)= [ p^^,{Ax)^dPj{fi). 
Jq. 1 Jni 

(28) 

The existence of conditional distributions is also guar¬ 
anteed by the Radon-Nikodym theorem. Explicitly we 
have 

P n i dP{Ax X •) 

Px\iiAx)= . (29) 

Continuing on from equation (I^Rll and applying an el¬ 
ementary theorem of Radon-Nikodym derivatives we 
have: 

f p^^j(h-\Aj))^dPiifi) 

Jqi dPi 

= l^^dPih-^iAi)xfi). (30) 


5.1 Variational interdomain approximations 

Here we consider the sparse variational interdomain 
approximation which was suggested b ut not realized 
in Figueiras-Vidal and Lazaro-Gredilla (|2009ll and ap¬ 
peared under the basis of the marginal consistency ar¬ 
gument in Alvarez et al ( 201lh . An interdomain variable 
is a random variable, indexed hy i € I defined in the 
following way: 

Md) = [ gt{x,0)f^dX{x) (33) 

Jx 

Here A is a measure on X with some appropriate a- 
algebra, {gi : i £ 1} is a. set of A-integrable functions 
from X to R. The interdomain variables may be viewed 
as deterministic conditional on the whole function fx so 
the theorems of section lU come into play. Since the 
intention here is to put this framework on a firm logical 
footing, we should also consider the thorny issue of the 
measurability of this transformation and the associated 
random variable. The existence of separable measur¬ 
able versions of stochastic processes including most com¬ 
monly used Gaussian processes was settled in the work of 
Doob (Il95.3h . It also discusses the conditions necessary 
to apply Fubini’s theorem to expectations of the random 
variable defined by equation (155)) . The application of Fu¬ 
bini’s theor em is essential to the utility of such method s 
in practice ( Figueiras-Vidal and Lazaro-Gredilfl 2009)1 . 
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Thus we may correctly optimize the parameters 0 of in¬ 
terdomain inducing points safe in the knowledge that 
this decision is variationally protected from overfitting 
and optimizes a well defined /C£-divergence objective. 
The potential for a wide variety of improved sparse ap¬ 
proximations in this direction is thus, in our opinion, 
significant. 


5.2 Approximations to Cox process posteriors 


In this section we relax the assumption that the data set 
D is finite, which is necessary to consider Gaussian pro¬ 
cess based Cox processes. On e spec ific case of this model 
is considered by Lloyd et al (|2015ll under the marginal 
consistency motivation. A Gaussian process based Cox 
process has the following generative scheme: 


As in section lSrl we will nee d to check tha t the conditions 
for Fubini’s theorem apply (jPoo ElUsi) which gives: 


JCC[Q\\P] =JCC[Qz\\Pz] - i^ogpiy)] 

V^Y 

+ [ Eq^ [p{x)] dm{x) + \og L{Y). (37) 

Jx 


For the specific case of p used in Lloyd et al ( 201,^ the 
working then continues as in that paper and the ele¬ 
gant results that follow all still apply. Note that one 
could combine these Cox process approximations with 
the interdomain framework and this could be a fruitful 
direction for further work. 


6 Conclusion and acknowledgements 


h = p{f) 

Y\h ~ VV{h). (34) 

Here QT’{m, K) denotes a Gaussian process with mean m 
and kernel A, p : R i—(0, oo) is an inverse link function, 
V'P(h) is a Poisson process with intensity h and D is a 
set of points in the original index set X. For example 
in a geographical spatial statistics application we might 
take X to be some bounded subset of The key issue 
with the Poisson process likelihood is that it depends not 
just on those points in X where points where observed 
but in fact on all points in X. Intuitively the absence 
of points in an area suggests that the intensity is lower 
there. Thus D = X. The likelihood in question is: 


L{Y\fD )=f n p{y) \ exp|-^p(a:)dTO(a;)| . 


(35) 


where m denotes for instance Lebesgue measure on X. 
The full X dependence manifests itself through the inte¬ 
gral on the right hand side. We will requi re tha t the inte¬ 
gral exists almost surely. In Lloyd et al ( 201511 equation 
(3), the application of Bayes’ theorem appears to require 
a density with respect to infinite dimensional Lebesgue 
measure. As pointed out in l3.1l such a notion is patholog¬ 
ical. This however can be fixed because the more general 
form of Bayes’ theorem in equation m of this paper still 
applies. Thus we can apply the results of section [32] to 
obtain: 

)CC[Q\\P] =mQz\\Pz] - Y. i^ogpiy)] 

veY 


■E, 


Qx 


p{x)dm{x) 


lx 


■ log L{Y). (36) 


In this work we have elucidated the connec tion be¬ 
tween the variational inducing point framework (jTitsiasl . 


2009a^ and a rigorously defined /C£-divergence between 


stochastic processes. Early use of the rigorous formu¬ 
lation of /C£-divergence to the Gaussian proce sses for 
machin e learning literature was made by Seeger ( 2003al : 
l2003bll . Here we have increased the domain of appli¬ 
cability of those proofs by allowing for inducing points 
that are not data points, and removing unnecessary de¬ 
pendence on Lebesgue measure. We would argue that 
our proof clarifies the central and elegant role played by 
Radon-Nikodym derivatives. We then consider for the 
first time in this framework the case where additional 
variables are added solely for the purpose of variational 
inference. We show that marginal consistency is not 
enough to guarantee a principled optimization objective 
but that if we make the inducing points deterministic 
conditional on the whole function then a principled op¬ 
timization objective is guaranteed and the parameters of 
the augmentation are variationally protected. We then 
show how the extended theory allows us to correctly han¬ 
dle principled interdomain sparse approximations and 
that we can cope correctly with the importance case of 
Cox processes where the likelihood depends on an infi¬ 
nite set of function points. 


It seems reasonable to hope that elucidating the measure 
theoretic roots of the formulation will help the commu¬ 
nity to generalise the framework and lead to even better 
practical results. In particular it seems that since inter¬ 
domain inducing points are linear functionals, the the¬ 
ory of Hilbert spaces might profitably be applied here. 
It also seems reasonable to think given the generality of 
section 13.31 that other Bayesian and Bayesian nonpara- 
metric models might be amenable to such a treatment. 
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